Conversation
Summary of ChangesHello @jaytiwarihub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request lays the groundwork for the Qwen2.5-VL model, integrating vision processing capabilities into the existing Qwen language model. It includes the core components for encoding visual information, projecting it into the text embedding space, and fusing it with text embeddings for multimodal processing. The changes also include modifications to the preprocessor to handle text-only inputs gracefully. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the initial skeleton for the Qwen2.5-VL model, including the backbone, vision encoder, and projector components, and refactors the Gemma3 preprocessor. While a good start, several areas require attention to align with repository standards. Specifically, new backbones should utilize the Keras Functional API, docstrings and get_config methods are missing, and there are implementation issues in Qwen2VLBackbone and Qwen2VLVisionBlock. Additionally, a minor code duplication issue was found in gemma3_causal_lm_preprocessor.py.
| combined_embeddings = keras.ops.concatenate([image_embeddings, text_embeddings], axis=1) | ||
|
|
||
| # Pass through the LLM | ||
| x = self.text_backbone.transformer_layers(combined_embeddings) |
There was a problem hiding this comment.
self.text_backbone.transformer_layers is a list of layers, not a single callable layer. This will raise an error. You need to iterate through the layers in a loop.
Additionally, this approach of accessing internal layers of self.text_backbone breaks encapsulation and is brittle. It would be better to either reuse the text_backbone's call method or restructure the model. The padding_mask input is also missing from the call to the transformer layers.
| # We will squeeze these back at the end. | ||
| batched = True | ||
|
|
||
| batched = True |
| @@ -0,0 +1,41 @@ | |||
| import keras | |||
| from keras_hub.src.models.backbone import Backbone | |||
| from keras_hub.src.models.qwen.qwen_backbone import QwenBackbone | |||
|
|
||
| def call(self, x, grid_thw=None): | ||
| # x shape: (Batch, Time, Height, Width, Channels) | ||
| x = self.patch_embed(x) |
|
Note to maintainers: Please update the PR title to [Model] Add Qwen2-VL Model Architecture and Preprocessing (I cannot see the edit button on my end) |
| from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM | ||
| from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector | ||
| from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import ( | ||
| Qwen2VLVisionEncoder, | ||
| ) |
There was a problem hiding this comment.
| from keras_hub.src.models.qwen2_vl.qwen2_vl_causal_lm import Qwen2VLCausalLM | |
| from keras_hub.src.models.qwen2_vl.qwen2_vl_projector import Qwen2VLProjector | |
| from keras_hub.src.models.qwen2_vl.qwen2_vl_vision_encoder import ( | |
| Qwen2VLVisionEncoder, | |
| ) | |
| from keras_hub.src.models.qwen2_vl.qwen2_vl_presets import backbone_presets | |
| from keras_hub.src.utils.preset_utils import register_presets | |
| register_presets(backbone_presets, Qwen2VLBackbone) |
the other imports inside the init are unecessary and go against repo standards
| """Qwen2-VL Backbone model. | ||
|
|
||
| This backbone combines the Vision Encoder and the Text Backbone. | ||
| It follows the KerasHub Functional API pattern. |
There was a problem hiding this comment.
weird line, better to remove and instead add parameters or args
| # Copyright 2024 The KerasHub Authors | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # https://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
There was a problem hiding this comment.
| # Copyright 2024 The KerasHub Authors | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # https://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. |
why is this here when it does not exist on any other model's backbone file?
| # Copyright 2024 The KerasHub Authors | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # https://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
There was a problem hiding this comment.
| # Copyright 2024 The KerasHub Authors | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # https://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. |
is not present in any other model's init file, curious as to why you put it here?
|
@jaytiwarihub, As you have mentioned the implementation is about Qwen2.5-VL, could you please rename the directories accordingly. All the communications are made clear in these issue threads and respective contributors working on these are assigned for these issues as well. Hope this clarifies all the confusion. |
|
@samudraneel05 thanks for help, i appreciated and those weird lines are just overwhelm feeling of mine , i'll take care of it |
sachinprasadhs
left a comment
There was a problem hiding this comment.
Thanks for the PR.
I just went through the PR high level, the code looks incomplete and also does not follow the Keras Hub design principles and guidelines.
Please refer https://github.com/keras-team/keras-hub/blob/master/CONTRIBUTING_MODELS.md for details
|
@sachinprasadhs thankyou for your kind review ! , I'm working on it |
PR Title
[Model] Add Qwen2-VL Model Architecture and PreprocessingPR Description
What does this PR do?
This PR implements the Qwen2-VL (Qwen2 Vision-Language) model architecture in Keras 3. Qwen2-VL is a state-of-the-art multimodal model that introduces "Naive Dynamic Resolution" support, allowing it to process images of arbitrary aspect ratios by converting them into dynamic grids.
Key Components Implemented:
Qwen2VLVisionEncoder: A 3D Vision Transformer backbone that supports 3D convolution patch embeddings and 3D Rotary Positional Embeddings (RoPE) to handle video and dynamic image inputs.Qwen2VLImageConverter: A preprocessing layer that implements the "Smart Resizing" logic, resizing images to optimal grid sizes based on the patch size (14x14) to minimize padding.Qwen2VLProjector: A lightweight MLP adapter that projects visual features into the LLM's embedding space.Qwen2VLCausalLM: The end-to-end model class connecting the vision tower with the Qwen2 text backbone.Qwen2VLCausalLMPreprocessor: The high-level preprocessor handling text tokenization and image tensor conversion.Technical Details:
MultiHeadAttention.Tests:
backbone_test,projector_test,image_converter_test).integration_test.py) verifying the end-to-end flow from raw text/image input to preprocessed tensors.Reference: